Introduction

The following report focus on analyzing blood samples taken from patients in Wuhan who were diagnosed with COVID19. Data have been taken between 10.01.2020 and 18.02.2020.

The main purpose of this report is to find out which blood component can be used to predict if the patient is going to die or to recover.

Executive summary

Performed analyzes show that men are more likely to not survive COVID-19 infection. More then every second men dies, at least in the analyzed dataset. The situation is not so dire for females, research shows that “only” every third doesn’t survive the infection.

Another important factor is age, in the report it is shown that the higher is the patient age the higher is the probability of not surviving this illness.

The report also focus on the impact of hospitalization time over the fatality, it shows that the longer patient stays in the hospital the higher is the chance of surviving. However, after 20 days in the hospital probability rises again.

The analyze was also put on the biomarkers suggested in the article Tan et al article, that means that we checked the correlation between LDH, CRP, lymphocytes count and fatality. This correlation shows that the lesser is the value of LDH and CRP then the higher is probability of dying. However, High number of Lymphocytes may suggest a change of recovering. At the end, there is a classification model that takes all of the parameters in to the consideration and it shows that the Tan et al article is right and LDH, CRP, lymphocytes count have an impact on fatality.

Used libraries

  • readxl
  • dplyr
  • tidyr
  • stringr
  • ggplot2
  • plotly
  • corrplot
  • caret

Dataset description

As it was mentioned in the Introduction, the report analyzes data of patients from Wuhan hospital. The data consists of record of 375 patients (224M vs 151F).

Each patient is described by multiple rows. Each row contains some generic properties, such as Age, Gender, Admission Time and so on. Additionally, each row represents another blood test, thus result of it populates appropriate columns. It may happened that a patient didn’t have a test for some properties and these values are missing.

To condensate each patient to one row, researched decided to calculate mean (with ignore to missing values) between corresponding results.

To overcome issue with missing values, researcher decided to replace missing values with a median of the corresponding values. The only exception was made for column regarding NCOV_NUCLEIC_ACID_DETECTION where missing values were replaced with 0.

The following table contains detailed results of the dataset.

DAYS_COUNT AGE GENDER DISCHARGE_TIME HAS_SURVIVED HYPERSENSITIVE_CARDIAC_TROPONINI HEMOGLOBIN SERUM_CHLORIDE PROTHROMBIN_TIME PROCALCITONIN EOSINOPHILS… INTERLEUKIN_2_RECEPTOR ALKALINE_PHOSPHATASE ALBUMIN BASOPHIL… INTERLEUKIN_10 TOTAL_BILIRUBIN PLATELET_COUNT MONOCYTES… ANTITHROMBIN INTERLEUKIN_8 INDIRECT_BILIRUBIN RED_BLOOD_CELL_DISTRIBUTION_WIDTH NEUTROPHILS… TOTAL_PROTEIN QUANTIFICATION_OF_TREPONEMA_PALLIDUM_ANTIBODIES PROTHROMBIN_ACTIVITY HBSAG MEAN_CORPUSCULAR_VOLUME HEMATOCRIT WHITE_BLOOD_CELL_COUNT TUMOR_NECROSIS_FACTOR.U.0391. MEAN_CORPUSCULAR_HEMOGLOBIN_CONCENTRATION FIBRINOGEN INTERLEUKIN_1.U.0392. UREA LYMPHOCYTE_COUNT PH_VALUE RED_BLOOD_CELL_COUNT EOSINOPHIL_COUNT CORRECTED_CALCIUM SERUM_POTASSIUM GLUCOSE NEUTROPHILS_COUNT DIRECT_BILIRUBIN MEAN_PLATELET_VOLUME FERRITIN RBC_DISTRIBUTION_WIDTH_SD THROMBIN_TIME X…LYMPHOCYTE HCV_ANTIBODY_QUANTIFICATION D.D_DIMER TOTAL_CHOLESTEROL ASPARTATE_AMINOTRANSFERASE URIC_ACID HCO3. CALCIUM AMINO.TERMINAL_BRAIN_NATRIURETIC_PEPTIDE_PRECURSOR.NT.PROBNP. LACTATE_DEHYDROGENASE PLATELET_LARGE_CELL_RATIO INTERLEUKIN_6 FIBRIN_DEGRADATION_PRODUCTS MONOCYTES_COUNT PLT_DISTRIBUTION_WIDTH GLOBULIN X.U.0393..GLUTAMYL_TRANSPEPTIDASE INTERNATIONAL_STANDARD_RATIO BASOPHIL_COUNT… X2019.NCOV_NUCLEIC_ACID_DETECTION MEAN_CORPUSCULAR_HEMOGLOBIN ACTIVATION_OF_PARTIAL_THROMBOPLASTIN_TIME HIGH_SENSITIVITY_C.REACTIVE_PROTEIN HIV_ANTIBODY_QUANTIFICATION SERUM_SODIUM THROMBOCYTOCRIT ESR GLUTAMIC.PYRUVIC_TRANSAMINASE EGFR CREATININE
Min. : 0.00 Min. :18.00 Length:375 Min. :2020-01-23 09:09:23 Mode :logical Min. : 1.9 Min. : 61.8 Min. : 74.60 Min. :11.50 Min. : 0.0200 Min. :0.0000 Min. : 65.5 Min. : 17.00 Min. :18.55 Min. :0.0000 Min. : 5.00 Min. : 2.75 Min. : -1.0 Min. : 0.500 Min. : 42.00 Min. : 5.00 Min. : 1.100 Min. :10.70 Min. : 1.80 Min. :47.20 Min. : 0.0200 Min. : 25.00 Min. : 0.000 Min. : 61.95 Min. :17.56 Min. : 0.716 Min. : 4.000 Min. :299.0 Min. :0.550 Min. : 5.000 Min. : 1.700 Min. : 0.0250 Min. :5.000 Min. : 1.850 Min. :0.000000 Min. :2.070 Min. :3.130 Min. : 1.000 Min. : 0.320 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. :13.60 Min. : 0.150 Min. :0.02000 Min. : 0.210 Min. :1.004 Min. : 7.667 Min. : 84.2 Min. :10.00 Min. :1.780 Min. : 5 Min. : 116.0 Min. :11.20 Min. : 1.50 Min. : 4.00 Min. : 0.0300 Min. : 8.20 Min. :18.50 Min. : 7.00 Min. :0.840 Min. :0.00000 Min. :-1.0000 Min. :20.80 Min. : 21.80 Min. : 0.10 Min. :0.05000 Min. :119.1 Min. :0.0100 Min. : 1.00 Min. : 5.00 Min. : 2.15 Min. : 12.50
1st Qu.: 5.00 1st Qu.:46.00 Class :character 1st Qu.:2020-02-11 13:39:21 FALSE:174 1st Qu.: 3.7 1st Qu.:114.2 1st Qu.: 99.12 1st Qu.:13.57 1st Qu.: 0.0450 1st Qu.:0.0000 1st Qu.: 615.0 1st Qu.: 56.00 1st Qu.:29.35 1st Qu.:0.1000 1st Qu.: 5.10 1st Qu.: 7.50 1st Qu.:132.8 1st Qu.: 3.767 1st Qu.: 86.42 1st Qu.: 14.25 1st Qu.: 4.129 1st Qu.:12.03 1st Qu.:64.70 1st Qu.:62.82 1st Qu.: 0.0400 1st Qu.: 70.63 1st Qu.: 0.000 1st Qu.: 87.00 1st Qu.:34.04 1st Qu.: 5.484 1st Qu.: 7.950 1st Qu.:336.0 1st Qu.:3.690 1st Qu.: 5.000 1st Qu.: 4.000 1st Qu.: 0.5675 1st Qu.:6.250 1st Qu.: 3.927 1st Qu.:0.001548 1st Qu.:2.270 1st Qu.:4.042 1st Qu.: 5.771 1st Qu.: 3.337 1st Qu.: 3.300 1st Qu.:10.22 1st Qu.: 620.9 1st Qu.: 38.70 1st Qu.:16.00 1st Qu.: 5.342 1st Qu.:0.05000 1st Qu.: 0.570 1st Qu.:3.023 1st Qu.: 21.000 1st Qu.:202.9 1st Qu.:21.38 1st Qu.:2.020 1st Qu.: 138 1st Qu.: 226.2 1st Qu.:26.35 1st Qu.: 16.04 1st Qu.: 4.90 1st Qu.: 0.3200 1st Qu.:11.20 1st Qu.:30.42 1st Qu.: 22.00 1st Qu.:1.030 1st Qu.:0.01000 1st Qu.:-1.0000 1st Qu.:29.86 1st Qu.: 37.00 1st Qu.: 11.75 1st Qu.:0.08000 1st Qu.:138.1 1st Qu.:0.1600 1st Qu.: 18.00 1st Qu.: 17.67 1st Qu.: 68.92 1st Qu.: 59.00
Median :10.00 Median :62.00 Mode :character Median :2020-02-16 17:40:07 TRUE :201 Median : 13.0 Median :125.8 Median :101.80 Median :14.31 Median : 0.1000 Median :0.2500 Median : 693.5 Median : 69.29 Median :33.15 Median :0.2000 Median : 6.20 Median : 10.59 Median :197.0 Median : 6.483 Median : 88.00 Median : 16.60 Median : 5.500 Median :12.54 Median :76.40 Median :66.72 Median : 0.0500 Median : 86.00 Median : 0.010 Median : 89.94 Median :36.77 Median : 7.780 Median : 8.425 Median :343.3 Median :4.365 Median : 5.000 Median : 5.434 Median : 0.9137 Median :6.500 Median : 4.437 Median :0.020000 Median :2.360 Median :4.337 Median : 6.990 Median : 5.473 Median : 4.625 Median :10.80 Median : 753.0 Median : 40.62 Median :16.64 Median :14.700 Median :0.06000 Median : 1.345 Median :3.636 Median : 29.000 Median :245.7 Median :23.52 Median :2.110 Median : 318 Median : 306.4 Median :30.91 Median : 21.09 Median : 5.70 Median : 0.4225 Median :12.60 Median :32.98 Median : 33.20 Median :1.100 Median :0.01500 Median :-1.0000 Median :30.87 Median : 39.50 Median : 49.00 Median :0.09000 Median :139.9 Median :0.2150 Median : 30.00 Median : 25.00 Median : 89.38 Median : 75.49
Mean :10.86 Mean :58.83 NA Mean :2020-02-15 16:42:59 NA Mean : 589.3 Mean :125.2 Mean :102.38 Mean :15.53 Mean : 0.7515 Mean :0.6588 Mean : 832.4 Mean : 81.61 Mean :33.12 Mean :0.2218 Mean : 10.63 Mean : 15.09 Mean :195.3 Mean : 6.612 Mean : 87.90 Mean : 45.66 Mean : 6.668 Mean :12.97 Mean :75.78 Mean :66.35 Mean : 0.1108 Mean : 82.78 Mean : 6.182 Mean : 89.91 Mean :36.86 Mean : 12.882 Mean : 10.263 Mean :343.6 Mean :4.427 Mean : 5.903 Mean : 8.458 Mean : 1.0969 Mean :6.468 Mean : 7.747 Mean :0.038627 Mean :2.346 Mean :4.401 Mean : 8.341 Mean : 7.280 Mean : 8.429 Mean :10.88 Mean : 1170.5 Mean : 41.87 Mean :17.43 Mean :16.592 Mean :0.09812 Mean : 5.771 Mean :3.667 Mean : 47.028 Mean :279.3 Mean :22.89 Mean :2.103 Mean : 2006 Mean : 454.2 Mean :31.57 Mean : 71.26 Mean : 27.86 Mean : 0.5537 Mean :12.97 Mean :33.19 Mean : 49.89 Mean :1.227 Mean :0.01729 Mean :-0.1627 Mean :30.91 Mean : 40.62 Mean : 69.16 Mean :0.09712 Mean :140.7 Mean :0.2144 Mean : 32.97 Mean : 38.02 Mean : 84.31 Mean : 104.74
3rd Qu.:16.00 3rd Qu.:70.00 NA 3rd Qu.:2020-02-19 11:47:14 NA 3rd Qu.: 31.9 3rd Qu.:137.3 3rd Qu.:104.29 3rd Qu.:15.90 3rd Qu.: 0.3050 3rd Qu.:0.9000 3rd Qu.: 780.5 3rd Qu.: 90.87 3rd Qu.:36.90 3rd Qu.:0.3000 3rd Qu.: 7.20 3rd Qu.: 14.45 3rd Qu.:245.5 3rd Qu.: 8.858 3rd Qu.: 89.20 3rd Qu.: 20.60 3rd Qu.: 7.460 3rd Qu.:13.50 3rd Qu.:90.22 3rd Qu.:70.65 3rd Qu.: 0.0600 3rd Qu.: 95.25 3rd Qu.: 0.010 3rd Qu.: 92.75 3rd Qu.:39.70 3rd Qu.: 12.895 3rd Qu.: 9.100 3rd Qu.:350.5 3rd Qu.:5.145 3rd Qu.: 5.000 3rd Qu.: 9.700 3rd Qu.: 1.3458 3rd Qu.:6.563 3rd Qu.: 5.330 3rd Qu.:0.060000 3rd Qu.:2.430 3rd Qu.:4.626 3rd Qu.: 9.342 3rd Qu.:10.408 3rd Qu.: 6.900 3rd Qu.:11.40 3rd Qu.: 865.4 3rd Qu.: 43.55 3rd Qu.:17.50 3rd Qu.:25.700 3rd Qu.:0.08000 3rd Qu.:10.694 3rd Qu.:4.187 3rd Qu.: 41.000 3rd Qu.:320.4 3rd Qu.:25.30 3rd Qu.:2.193 3rd Qu.: 815 3rd Qu.: 590.7 3rd Qu.:35.50 3rd Qu.: 27.50 3rd Qu.: 7.30 3rd Qu.: 0.5358 3rd Qu.:14.03 3rd Qu.:35.97 3rd Qu.: 51.90 3rd Qu.:1.260 3rd Qu.:0.02000 3rd Qu.: 1.0000 3rd Qu.:32.00 3rd Qu.: 42.62 3rd Qu.:112.80 3rd Qu.:0.10000 3rd Qu.:142.3 3rd Qu.:0.2600 3rd Qu.: 40.00 3rd Qu.: 38.38 3rd Qu.:104.20 3rd Qu.: 95.00
Max. :35.00 Max. :95.00 NA Max. :2020-03-04 16:21:51 NA Max. :50000.0 Max. :178.0 Max. :140.20 Max. :83.35 Max. :29.2400 Max. :5.5500 Max. :7500.0 Max. :620.00 Max. :46.30 Max. :1.7000 Max. :750.00 Max. :390.85 Max. :554.0 Max. :35.200 Max. :130.00 Max. :3409.00 Max. :102.400 Max. :25.60 Max. :98.80 Max. :81.50 Max. :11.9500 Max. :142.00 Max. :250.000 Max. :116.15 Max. :52.30 Max. :370.930 Max. :103.550 Max. :488.0 Max. :8.950 Max. :88.500 Max. :58.100 Max. :43.0550 Max. :7.565 Max. :252.253 Max. :0.380000 Max. :2.790 Max. :6.860 Max. :28.975 Max. :26.795 Max. :288.450 Max. :14.20 Max. :50000.0 Max. :100.10 Max. :92.75 Max. :52.350 Max. :2.09000 Max. :40.500 Max. :6.160 Max. :959.500 Max. :993.0 Max. :30.07 Max. :2.580 Max. :70000 Max. :1867.0 Max. :58.60 Max. :5000.00 Max. :190.80 Max. :36.9400 Max. :24.20 Max. :48.20 Max. :732.00 Max. :5.450 Max. :0.10000 Max. : 1.0000 Max. :50.80 Max. :102.75 Max. :320.00 Max. :0.27000 Max. :179.6 Max. :0.5100 Max. :106.00 Max. :1061.00 Max. :215.45 Max. :1430.00

Statistics

Overall fatality

The following graph represents an overall comparison between people who has survived and those who weren’t so lucky.

Gender based

The following graph represents an overall comparison between men and women.

The chart below shows the gender mortality.

By Age

The following chart presents age histogram, divided by gender.

The following chart shows mortality by age.

The following chart shows mortality by age and gender.

By days in the hospital

The following chart presents amount of days spent in the hospital, with respect to a gender.

The following chart represents how probability of death changes with a respect to the amount of days spent in the hospital.

Correlation between selected attributes

Amount of Deaths/Survivors over time

The following graph shows how the number of survivors/deaths changed over time.

Classification

The following chapter focus on the classification model. Random Forest was used.

## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ALIVE DEAD
##      ALIVE    50    1
##      DEAD      0   42
##                                           
##                Accuracy : 0.9892          
##                  95% CI : (0.9415, 0.9997)
##     No Information Rate : 0.5376          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9783          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9767          
##          Pos Pred Value : 0.9804          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5376          
##          Detection Rate : 0.5376          
##    Detection Prevalence : 0.5484          
##       Balanced Accuracy : 0.9884          
##                                           
##        'Positive' Class : ALIVE           
## 
## rf variable importance
## 
##   only 20 most important variables shown (out of 78)
## 
##                                     Overall
## NEUTROPHILS...                       28.893
## LACTATE_DEHYDROGENASE                18.645
## X...LYMPHOCYTE                       16.168
## INTERNATIONAL_STANDARD_RATIO         10.332
## UREA                                  6.543
## PROTHROMBIN_TIME                      6.249
## ALBUMIN                               5.115
## NEUTROPHILS_COUNT                     4.904
## AGE                                   4.512
## LYMPHOCYTE_COUNT                      4.132
## DAYS_COUNT                            2.842
## HIGH_SENSITIVITY_C.REACTIVE_PROTEIN   2.678
## DISCHARGE_TIME                        2.404
## GLUCOSE                               2.332
## PROCALCITONIN                         1.964
## PLATELET_COUNT                        1.655
## EOSINOPHILS...                        1.439
## D.D_DIMER                             1.392
## DIRECT_BILIRUBIN                      1.149
## HYPERSENSITIVE_CARDIAC_TROPONINI      1.112